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Abstract 

We consider the privacy problem in data publishing: given a relation I containing sensitive information 
"anonymize" it to obtain a view V such that, on one hand attackers cannot learn any sensitive information 
from V, and on the other hand legitimate users can use V to compute useful statistics on I. These are 
conflicting goals. We use a definition of privacy that is derived from existing ones in the literature, which 
relates the a priori probability of a given tuple t, Pr(t), with the a posteriori probability, Pr(t\V), and 
propose a novel and quite practical definition for utility. Our main result is the following. Denoting n the 
size of I and m the size of the domain from which / was drawn (i.e. n < m) then: when the a priori 
probability is Pr(t) = Q(n/i/m) for some tuples t there exists no useful anonymization algorithm, while 
when Pr(t) = 0(n/m) for all tuples t then we give a concrete anonymization algorithm that is both private 
and useful. Our algorithm is quite different from the fc-anonymization algorithm studied intensively in the 
literature, and is based on random deletions and insertions to I. 

1 Introduction 

The need to preserve private information while publishing data for statistical processing is a widespread problem. 
By studying medical data, consumer data, or insurance data, analysts can often derive very valuable statistical 
facts, sometimes benefiting the society at large, but the concerns about individual privacy prevents the dissem- 
ination of such databases. Today's state of the art in data anonymization is the fc-anonymity method [10 : the 
privacy (or lack thereof) of fc-anonymity has been studied and improved recently [HI [El HI EE EI] , but this method 
fails to offer any formal guarantees for computing statistical properties, which limits the use of fc-anonymized 
data. 

Clearly, any anonymization method needs to trade off between privacy and utility: removing all items from 
the database achieves perfect privacy, but total uselessness, while publishing the entire data unaltered is at the 
other extreme. In this paper we study the tradeoff between the privacy and the utility of any anonymization 
method, as a function of the attacker's background knowledge. When the attacker has too much knowledge, 
we show that no anonymization method can achieve both. But for practical purposes one can often assume 
that there is a bound on the amount of knowledge the attacker has, and then we show that the tradeoff can be 
achieved by a new, and very simple anonymization algorithm. 



1.1 Background 

To place our work in context we describe here the problem and some of the issues raised by the fc-anonymization. 

Consider a database / of English test scores as shown in Table [TJ It would be useful to make some form of 
this data publicly available, for example in order to allow researchers in public education to study correlations 
between various ages, nationalities, and test scores, without releasing the test scores of individual people. While 
names have already been removed from the table, we don't want to publish this data unchanged because an 
attacker may use some of the age-nationality values to uniquely identify an individual (this was first shown by 
Sweeney 10 ): for example, he may know that Joe is 21 years old and of Indian Nationality and, since only one 
entry in the database matches this age and nationality, the attacker can learn Joe's test score. The problem 
is to anonymize the data: more precisely to compute a view V that perturbs the data to make individual 
identifications impossible. However, we need to do this such as to allow legitimate users to compute answers to 
"statistical queries" . Examples of such queries are: How many people in the age group 20-30 have score greater 
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Age 


Nationality 


Score 


25 


British 


99 


27 


British 


97 


21 


Indian 


82 


32 


Indian 


90 


33 


American 


94 


36 


American 


94 



Table 1: A database I of test scores. 

than 90 ? or How many people in a particular country have score less than 90 ? Users evaluate multiple queries 
like these then combine their answers to derive important correlation between age, score, and nationality: they 
don't need to see individual scores. 

The data anonymization problem has been studied intensively in recent years. Virtually all techniques 
described in the literature are variations of the fc-anonymity method [TUl El [7l [6l [IT] , which we illustrate next. 
Note that the anonymization problem is different from the query perturbation problem, studied elsewhere [HOG]. 

Figure [5] shows the fc-anonymity method and several variations described in the literature, for the instance 
/ in Table [TJ fc-anonymization is based on generalizing attribute values, e.g. replacing age 27 with an interval 
21 — 30, or even with a wildcard *. Table 2(a) shows the original idea in fc-anonymization [5]: generalize the 
attribute values such that every tuple occurs in the data at least fc times, in our case k = 2. Our hypothetical 
attacker is no longer able to learn Joe's score, since now there are 2 tuples that match Joe's age and nationality 
(in general: k tuples) . But by generalizing the score values one arguably reduces the utility of the data, and for 
that reason researchers have proposed to differentiate between the sensitive attribute (the score in our example) 
and the quasi-identifiers (age, nationality in our case), and to anonymize only the latter: this is illustrated in our 
example in Table 2(b), which is still a 2-anonymous table but on the quasi-identifiers only, while the scores are 
unchanged. One problem with this approach (which is not addressed in the literature) is that it is not always 
obvious how to classify attributes into quasi-identifiers and sensitive attributes, and this is presumably left to the 
application: for that reason, in our paper we do not require such a classification of attributes. Continuing our 
illustration, observe that by keeping the sensitive attribute unaltered one runs the risk of major privacy breaches 
(as first noted in [8]). This is illustrated in Table 2(b) in the third group: all scores in this group are 94, hence 
an attacker who knows that (say) Jane is an American can learn that her test score is 94. The solution proposed 
in [8] is to further require the anonymized data to satisfy a condition called Z-diversity, namely that in each group 
there be at least / distinct values of the sensitive attribute. This is shown in Table 2(c), which is 3-anonymous 
and 2-diverse: within each group, the attacker finds at least 2 (in general I) scores that could potentially belong 
to a particular individual, hence, arguably he cannot guess a private piece of data with probability more than 

Thus, most of the prior research on data anonymization has focused on understanding and improving the 
data's privacy. The formal definitions of privacy in the literature rely on comparing the a priori and the a 
posteriori probability of a sensitive tuple t being in the database [8j. The a priori probability Pr[t] is the 
attacker's belief before seeing the published data: for example, the attacker may believe that Joe, who 21 years 
old and Indian, had a test score of 82 with probability 5%. The a posteriori probability is after seeing the 
published view, Pr[t | V}: for example, after seeing the view in Table [2] (b) the a posteriori probability of Joe 
having a test score of 82 is 50%. When these two probabilities are close, or when the latter is low, then privacy 
is said to be preserved. In this paper we use a formal definition of privacy that follows the same principles. 

Much less understood is the data's utility. Consider the following query: compute the number of individuals 
between 26-32 years that received a score > 90: it is unclear how to estimate the query's answer from any of these 
three tables, and what guarantees this estimate offers. A definition of utility based on entropy is given in [6], 
but this compares the a priori and the a posteriori distribution, and does not give any guarantees on estimating 
the answer to a counting query. 
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c) 2-diversity and 3-anonymity 
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(d) (d, 7)-privacy 



Table 2: Different methods for computing an anonymized view V for the instance / in Table. [T] 
1.2 Our Contributions 

In this paper we study the privacy/utility tradeoff. For that we give a definition of utility based on estimating 
counting queries, i.e. queries of the form count the number of records satisfying a certain predicate. Our definition 
requires that all counting queries whose answer is "large" be accurately estimated from the anonymized data, 
with a guarantee on the accuracy that does not depend on any additional assumptions on the original data. For 
example, if the query count the number of persons with age between 26 and 32 who received a score over 90 has 
a large answer (say, l/50th of the size of the instance), then we require it to be possible to estimate it with 
high accuracy from the anonymized data. None of the variations of fc-anonymity described earlier allows us to 
compute a good estimate to this query: for example in Table [2] (c) all six entries could have an age between 26 
and 32, and the only way to estimate the query's answer is to make further assumptions about the distribution 
of the data: e.g. that ages are uniformly distributed in the interval 21-40, and that the distribution on ages is 
independent on the test score. But if we ask users to know such properties about the data then we are defeating 
their very purpose in analyzing the data: the users would like to discover whether age and the test score are 
independent or not, and not to be required to know this in order to discover other facts. 

Note that our definition of utility applies only to queries with large answers. This is crucial: queries with 
small answers could leak privacy if we allow them to be estimated accurately. For example the query count the 
number of Indian persons, of age 21, who received a score of 82 has answer 1 on the original data, and if we 
allowed it to be estimated with high accuracy then privacy would be breached. 

Our main result is an almost complete separation of the case when a private/useful anonymization algorithm 
is possible from that when it is impossible, based on the attacker's knowledge. If the attacker's prior tuple 
probability is as high as fl(n/y/rn), where n is the size of the database and m the size of the domain D, then 
an utility preserving anonymization algorithm is impossible. This impossibility result holds even when the prior 
distribution is tuple independent. If the attackers attacker's prior is 0(n/m), then privacy/utility is possible. 
We prove that by giving a new, and very simple anonymization algorithm: randomly remove each tuple in / 
with probability a, and randomly insert each tuple in the domain D with probability f3. A particular run of 
this algorithm is shown in Table [2] (d): Two tuples have been removed from the original data, and eight new 
tuples have been inserted. (For presentation purposes we indicate which four tuples come from / and which were 
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inserted randomly: in practice this separation is hidden.) Both privacy and utility can be achieved by tuning a 
and f3, as long as the attacker's prior is 0(n/m). Importantly, the accuracy of any counting query is guaranteed 
as a probability on the random choices made by the algorithm, without any assumptions on the original data. 
The proof of our algorithm's privacy assumes that the prior distribution is tuple independent, or that it satisfies 
some very limited form of correlations. This is clearly a weakness, but in some sense it is unavoidable: we show 
that if the attacker is allowed to know arbitrary tuple correlations, then no utility-preserving algorithm exists 
even when the prior is £l(n/m). 

Our impossibility result extends a similar result by Dwork and Nissim [3], which was derived for a different 
privacy setting. We give the formal definition of the privacy setting in Sec. 13.11 Using this notion of privacy, 
they showed that no such algorithm can be useful. We improve here, the impossibility result (by tightening 
the bounds, and relaxing the definition of counting queries), and show how to extend it to the more traditional 
notion of privacy. 

Random deletions and insertions in the database are well known privacy preserving techniques, which have 
been used in a variety of settings, e.g. in private data mining [5]. Perhaps surprisingly, our algorithm is the first 
application of these techniques to data publishing. 

Overview Privacy and utility are defined in Sec. [5] the negative (impossibility) result is in Sec. [31 and the 
positive result (the new algorithm) is in Sec. SJ Extensions to tuple correlations are discussed in Sec. [51 

2 Notations and Definitions 

Let / = {ti,t2, ■•, t n } be a database instance where each tuple U is drawn from a domain D of size m (i.e. I C D). 
We need to design a privacy preserving algorithm A which takes as input an instance / and publishes a view 
V, of the same schema as / (i.e. VCD). The view V becomes public knowledge while the original instance / 
remains hidden. 

Modeling the Adversary We model the adversary's background knowledge using a probability distribution 
Pri over the set of all possible database instances /. Formally, Pr\ : 2 D — * [0, 1] is such that J^icd P r i[I] = 1- 
For each tuple tgDwe denote Pr\\t\ the marginal distribution (i.e. Pri[t] = J^i tei P r Al])> an d call Pr% tuple- 
independent when these are independent events. Clearly, we do not know Pr\, only the attacker knows it, and we 
should design our privacy algorithm assuming the worst about Pi\. As we shall see, however, it is impossible to 
achieve privacy and utility when the attacker is all powerful, hence we will make reasonable assumptions about 
Pr\ below. 

Modeling the Algorithm Given an instance /, a privacy-preserving algorithm A will make some random 
choices and compute a view V. We denote Pr^lX! the probability distribution on the algorithm's outputs, i.e. 
Pr{ : 2 D -> [0,1] is s.t. J2vcD Pr 2[V] = l - Unlike Pr u we have total control over Pr2, since we design the 
algorithm A. 

As the algorithm A needs to be publicly known, an adversary can compute the induced probability Pri2[ij|V] 
based on his prior distribution Pr\ and the algorithm's distribution Pr 2 , namely Pri 2 [^|U] = J2icD tei P^W]^^]- 
We call Pri[t] as the prior probability of tuple t while Pri 2 [i|y] is called the posterior probability oft conditioned 
on view V. Throughout this paper we assume that the adversary is computationally powerful and that Pr^^U] 
can always be computed by the adversary irrespective of the computation effort required. 

2.1 Classification of Adversaries 

Dwork's impossibility result [3] suggest that it is impossible to anonymize data such that it is both useful and 
protects against arbitrary adversaries. If taken literally, it seems to say that anonymization is hopeless. However, 
there are lots of examples in practice when data is being anonymized and published under risks considered to 
be acceptable, and this indicates that the privacy definition used by Dwork is too general. 

We propose here to study adversaries with restricted power, based on placing a bound on his prior knowledge, 
and will show that for certain values of this bound privacy is possible. 

Specifically, we restrict the adversaries by assuming that for every tuple t his prior probability Pr± [t] is either 
small, or equal to 1. In the first case we will seek to hide the tuple from the adversary; in the second case there's 
nothing to do, since the adversary already knows t. Another way to look at this is that we require the algorithm 
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to preserve the privacy for tuples for which Pri[i] is small: if Pri[t] is large, we may well assume that it is 1, in 
essence giving up any attempt to hide t from the adversary. 

Definition 2.1. Let d£ (0,1). A d-bounded adversary is one for which Vi G D, either Pr*i[t] < d or Pri[t] = 1. 

We also consider a sub-class of d-bounded adversaries called the d-independent adversary which further 
require Pr\ to be tuple-independent: 

Definition 2.2. A d-independent adversary is a d-bounded, tuple-independent adversary. 
2.2 Privacy definition and motivation 

Our definition of privacy is based on comparing the prior probability Pr\ [t] with the a posteriori probability 
Pr±2[t | V], as standard in the literature. We consider here two definitions, one relative and one absolute, and 
will prove that, in some sense, they are equivalent. 

Definition 2.3. An algorithm is called (d, 6) -relative-private if the following holds for all d-independent adver- 
saries Pr\, views V , and tuples t: 

~ Pn[t] ~ 

Denoting de s = 7, the definition can be written as: 

d < Pr 12 [t\V] < 7 
7 ~ Pri[t] ~ d 

Moreover, if Pr[t] < d then Pri2[t|y] < 7, which justifies the following definition of absolute privacj0: 

Definition 2.4. An algorithm is called (d, j) -private if the following holds for all d-independent adversaries Pr\ , 
views V , and tuples t s.t. Pr\[t] < d: 

i<E^m PMt|v] < 7 

7 Pr\ [t\ 

If tuple t fails the left (or right) inequality then we say there has been a negative (or positive) leakage, 
respectively. Note that positive leakage resembles the notion of positive (p\, P2) breach as described in [5]. We 
prove the following in the Appendix: 

Proposition 2.5. Every (d, S) -relative-private algorithm is (d,j)-private for 7 = de s . Conversely, every (d, 7)- 
private algorithm is (d, S) -relative-private for 
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2.3 Utility 



As explained earlier, we are interested in supporting counting queries over some predicates. More generally, we 
define a query Q to be any subset of the domain D, i.e. Q C D. The result of the query over the instance / is 
simply \Q n I\, which we denote as Q{I). For example, the query count the number of tuples with age between 
26 and 31 and score > 91 is expressed as Q(I), where Q denotes all tuples in the domain having age in [26, 31] 
score > 91. 

For any algorithm A, using the knowledge of how the algorithm publishes its view one can obtain an estimate 
for Q(I). Let EST(Q,V) be the estimatqj of Q(I) as obtained from the published view V. Additionally let 
\Q(I) — EST(Q, V)\ be the absolute error for a query Q. We require the algorithm to be such that EST(Q, V) 
provides a good approximation for Q(I). 



1 Only the positive leakage is absolute. For the negative leakage an absolute definition makes no sense. 
2 We assume in this paper that EST is deterministic. 
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Definition 2.6. A randomized algorithm A is called useful if there exists an estimator EST (Q,V) s.t. for any 
e > there exists a constant p such that for all domain size m, database size n, and query Q: 

Pr 2 [ \Q{I) - EST(Q, V)\ > P Vn~ ] < e 

2.4 Discussion 

Our working definition of privacy is (d, 7)-privacy, where d intuitively corresponds to the prior, and 7 to the 
posterior. An intuitive reference point for d is d = n/m, since the expected size of I is n for the independent 
distribution where Pri[i] = n/m for all t. 7 is larger, and should be measured in absolute values. For example, 
7 = 0.05 means that the view V leaks no tuple t with more than 5% probability. 

The motivation for our definition of utility is to give a guarantee on the absolute error of the estimator. It is 
stated as Ve.3p rather than Vp.3e because the latter is trivially satisfied by any estimator (simply choose e = 1). 
The absolute error is expressed as p\/n for two reasons. On one hand we do not want small errors: if one could 
estimate Q(I) with an error < 1 then there would be privacy breaches, as we have seen. On the other we do not 
want big errors: any estimator is accurate with probability 1 if the absolute error is n. In between 1 and n we 
have chosen p^/n. 

/c-anonymity and Z-diversity satisfy neither (d, 7)-privacy nor usefulness. Considering a d-independent 
adversary, privacy is compromised when the adversary knows that some tuples have very low probability. To see 
the intuition, consider the instance in Table (He). Suppose a d- independent adversary is trying to find out Joe's 
test score, and he knows that Joe is likely to have a low score: i.e. the prior is such that Pt\ [t] is very low for 
a tuple saying that Joe's score is greater than 95, and larger (but still < d) for tuples saying that his score less 
than 90. If the adversary knows that Joe's age is less than 30, then his record is among the first three: since 
the first two tuples have a very low probability (as their scores are 99, 97), the adversary concludes that Joe's 
scores is 82 with very high probability. There is no utility either. Suppose we want to estimate the number of 
students between 29 and 31 years old from Table[D(c): the answer can be anywhere between and 6, and, if our 
estimate is the average, 3, then the only way we can guarantee any accuracy is by making assumptions on the 
distribution of the data, in essence by knowing Pr± . 

By contrast, the algorithm described in Seed] takes two constants fc,7 and is both (fc— , 7)-private and also 

useful: for any e, it gives p = 4.y/3~in(~). 

3 Impossibility Results 

We prove here that no (d, 7) -private algorithm can provide even a weak notion of utility if d = Q,(n/y/m). For 
that we first establish an impossibility result for a weaker notion of privacy, called e-indistinguishability, and a 
very weak notion of utility, called mcaningfulncss: in this form, our result is an improvement of the impossibility 
result in [4]. Then we establish an impossibility result for our notions of privacy and utility. 

3.1 The Strong Impossibility Result 

Consider the following alternative definition of privacy [1] : 

Definition 3.1. An algorithm is e-indistinguishable if for all database instances I and V which disagree exactly 
over a pair of tuples (i.e. \I\ = \I'\ and \ I — V |= 2) and for all views V, 

e" £ < — 2i_L < e « 

Pr{ [V] 

The weak notion of utility, which we call meaningfulness was first considered in [4] and is based on the notion 
of statistical difference: 

Definition 3.2. The statistical difference between two distributions PrA and Prs over the domain X is SD(Pta, Ptb) = 
Y, x& x\ p rA{x)-Pr B {x)\. 
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Note that SD{Pr A ,Pr B ) E [0,2]: it is when Pr A = Pr B , and is 2 when Vx.{Pr A {x) = V Pr B {x) = 0). 
We explain now informally the connection between statistical difference and utility (the formal connection is in 
Prop |3.6l below). Suppose an algorithm A gives reasonable estimates to counting queries Q. Let Q be a "large" 
query, i.e. if executed on the entire domain it returns sizeable fraction, say 1/5. When we estimate Q on the 
view published for a particular instance I we will get some errors, which depend on actual size n = \I\. If there 
is any utility to A then a user should be able distinguish between the two extreme cases, when Q(I) = n and 
when Q(I) — 0. To capture this intuition, define the uniform distribution Pr\: V/ s.t |/| = n, Pr\[I] = jh^- 

Note that ( m ) is the number of instances of size n. Thus Pr\ makes every instance of size n equally likely. Let 
Eq be the event (|7| = n A Q(I) = n), and E'q the event (|/| = n A Q(I) = 0). Then we expect to be able to 
differentiate between the following two distributions: Prj[ = Pri2[V\EQ] and Pr® = Pri2\V\E'q\. To obtain a 

reasonable estimate for Q, SD{Pr A , Pr®) should be large. On the other hand, if SD(Prj[, Pr®) is very small 
then no reasonable estimate of the query can be obtained from any of the published views. An algorithm is 
meaningless if the SD(Pr A , Pr®) is small for a large fraction of the queries Q. An algorithm is meaningful if it 
is not meaningless. 

Definition 3.3. Let f < 1 be a constant independent of the domain size m and database size n. Consider all 
queries Q s.t \{1 — /) < ^ < |(1 + /). An algorithm A is called meaningless if SD(Pr®, Pr®) is smaller than 
1/2 for a fraction 2/3 of queries Q. 

Next we state the strong impossibility result using meaningfulness as the definition of utility. For the sake of 
concreteness we use the constant 1/2 for statistical difference and 2/3 for the fraction of queries in the definition 
of meaningfulness. However, the impossibility result works for arbitrary constants. 

Theorem 3.4 (Strong Impossibility). There exists a constant c independent of m and n such that every algorithm 
which satisfies e-indistinguishability with e £ < is meaningless. 

Before giving the proof, we comment on how this result extends a similar result in [4]. First, we have 
generalized it to a larger class of queries: the previous result restricts the class of selection queries to certain 
xor operations, here we allow arbitrary selection queries. Secondly, we improve the bound on the statistical 
difference. This was possible because the original proof relies on a chaining argument which provides a bound on 
the statistical difference of a function at each step of the chain to eventually compute the statistical difference of 
Pr A and Pr B . At each step it considers tuples as points in a high dimensional space and bounds the statistical 
difference of a function which satisfies e-indistinguishability over tuples. We observe that each database instance 
can be thought of as a point in a higher dimensional space and thus bound the statistical difference of a function 
which satisfies e-indistinguishability over instances. 

Proof of Theorem 13.41 Consider all instances I such that |/| = n. The number of such instances with all 
distinct tuples is (™) . Moreover, as A satisfies e-indistinguishability, for any pair of instances I and I 1 where 
|/| = |/'| = n and |J- I'\ =2, 

" Pri [V] ~ 

We denote by Pri2[V] the probability that the view V is published. If Pr\ is the uniform distribution then 

prvm = ?4r E Pr ^ v ^ 

VnJ |j|=„ 

Consider a query Q s.t \Q\ — mr where ^(1 — /) < r < i(l + /). Let us represent the set of instances for 
which Q(I) — n as Sq. Then \Sq\ is (™ r )- Let Eq be the event that input database instance I belongs to Sq. 
Pri2[V\Eo\ is the probability that the view V is published if an instance / is picked uniformly at random from 
the set Sq. 

Pri2[V\E Q ] = -^- E Pr 2 [V\I] 

1 Q ' I<£Sq 
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If we consider all queries Q s.t \Q\ — mr, then we see that the expectation Eq(P?"i2[V|-Eq]) = Pri 2 [V]. We 
show in Lemma \B . 1 1 that if e e < m ^~ a 2 " then the variance of Pr[V|i?Q] over Q of fixed size mr is small and is 
less than ^^W 2 " 2 Uging ^ 

we can show that the statistical difference between the distributions i"V 12 [F] 
and Pti2\V\Eq\ is small with high probability over choice of Q. As shown in Lemma fB.3[ with probability at 
least 1 — a, 

SD(Pr 1 2[V],Pri 2 [V\E Q })=0 

Fix a = i. As r > ^(1 + /) and / is a constant independent of n and m, with probability greater than 

i 

§, SD(Pr 12 [V], Pr 12 [V\E Q ]) is O M^M*. The same holds for the statistical difference between Pr 12 [V] and 
Pri2\V\E'o\. Thus with probability greater than | over the choice of Q, 

SD(Pr 12 [V\E Q ],Pr 12 [V\E' Q ]) = 

Thus there exists a constant c such that if e £ < then SD(Pr 12 [V\E Q ], Pr 12 [V\E' Q ]) < \ for at least 2/3 of 
the queries Q making the algorithm meaningless. This completes the proof. 
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3.2 Impossibility for (d, 7)-Privacy 

We show now how the strong impossibility result translates to our notions of privacy/utility. First, we show that 
(d, <5)-relative-privacy implies e-indistinguishability. Note that the former is a privacy notion about an adversary, 
while the latter does not talk about any adversary. The proof of the following theorem is in the appendix. 

Proposition 3.5. Every (d, 6) -relative-private algorithm satisfies e-indistinguishability with e = 25 + 21n(2) 

Next, we connect the two notions of utility: 

Proposition 3.6. Any useful algorithm is also meaningful. 

Proof: Consider the distributions Pr®,Pr® defined above, for the events Eq — (Q(I) = n) A (|/| = n) and 
E'q — (Q(I) = 0) A (|7'| = n). Since the algorithm is useful, we will choose the value e = \ in Definition ^. 61 and 
obtain a value for p. Let Jo be the set of instances I such that Q{I) — and |7| = n. Similarly, let /„ be the 
set of instances I' such that Q(I') — n and |/'| = n. Let Vo be the set of views V such that EST(Q, V) < p^/n. 
Similarly, let V n be the set of views V such that EST(Q, V) > n — py/rl. 

As A is useful, a view V € Vq would be published on any I £ Iq with probability greater than |. Thus, 
Pri 2 [V e V \E Q ] > |. Similarly, Pr 12 [V e V n \E' Q ] > | 

If n > 2pyjn then Vq and V n have to be disjoint. As p is independent of n, we can choose a large enough n 
such that \/n > p. For those values of n, the statistical difference between Pr^ and Pr® is at least 2(| — |)=1, 
for every Q. Thus any useful algorithm is also meaningful 



Corollary 3.7. Let A be a meaningful algorithm and, let 7 < 1 be any given number. Then there exists a 

y n 
-7 y/m 



constant c independent of m,n and 7 such that there is a d-independent adversary with d = \ yz — 7= f or which 



one of the following is true for some tuple t 

• There is positive leakage: Pr\[t] < d but Pri2[t\V] > 7 

• There is negative leakage: — < \-j= 
Proof: We can define for any algorithm 

Pr 2 [V\I] 



max 



/,/' Pr 2 [V\P] 
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for all / and I' which disagree in a single pair of tuples. By Thcorcm l3.41 the algorithm can have meaningful utility 
only if e e = f2(-^). Using Proposition 13 . 51 we can infer that if the algorithm is meaningful, it cannot satisfy 

(d, (5)-relative-privacy unless e s > c^-, for some constant c independent of n and m. Thus, no meaningful 

algorithm can satisfy (d, 7) -privacy unless — c ^~- ^ s ^ * s meaningful, at least one of the following 

statements is true: 

• There is positive leakage: Pri[ti] < d but Pri2[£i|U] > , i 1 „ . 
In particular if d = iy^-2= then PrL 2 [i 4 |U] > 7 

• There is negative leakage: Pr u [t ; ]V] <^<^i^<!^= 

to to Pt-lIU] — 7 — 7 l-d — c JTn 

This completes the proof. 



4 Algorithm 

As we have seen, it is impossible to guarantee absence of privacy breaches against all ^-independent adversaries if 
d = f2(-^=). In this section, we present a simple algorithm which is (ci, 7)-private. Here, d and 7 are parameters 
given to the algorithm. The utility of the algorithm depends on the values of d and 7 and a guarantee for the 
utility can be made only for d = k^-, for any fixed constant k (hence d = O(m))- 

Assume that the input database instance is / C D. The insert-remove algorithm computes the view V as 
following: 

• For every tuple in J, insert it in the V independently with probability a 

• For every tuple in D which is not in J, insert it in the view V independently with probability (3 

• Publish D, V, a, 13. 



4.1 Privacy Analysis 

Theorem 4.1. The insert-remove algorithm is (rf, 7) -private where d < 7 if we choose a < 1 — — and j3 > 

Proof: Consider any tuple t. Let Pr[t] = p < d. We know that 

Pr[V\t]Pr[t] 



Pr[t\V] = 



Pr[V\t]Pr[t] + Pr[V\t\Pr[t] 
Pr[V\t]p 



Pr[V\t]p+Pr[V\t\(l-p) 
If d < 7 then < a. Now consider the following two cases: 

• t is present in the published view. Thus JY[V|f] = a and Pr[V|t] = (3. Using this we get, 

Pr[t\V] = aP 



ap + (3(1 — p) 
ad 

< — < 7 

ad + ad — - 

7 

On the other hand, for negative leakage, note that ^-p^Tp — f — ^ — 7- 

t is not present in the published view. Thus JV[V|f] = 1 — a and Pr|V |f j] = 1-/3. Using this we get, 

(1 - a)p 



Pr[t\V] 



(l-a)p+(l-[3)(l-p) 
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As a > (3, we get | > \^ and 



(1 — a)p ap 



(l-a)p+(l-(3)(l-p) ~ ap + 0(l-p) 

As ap+i 3(i_p) is an increasing function of ^ for a fixed p. Thus even in this case we have Pr[t|V] < 7. 
Additionally, > {^f} > 1 - a > f 

4.2 Utility Analysis 

The estimator EST(Q, V) is the following. Recall that a and (3 are published. 

• Let ny — Q(V); that is ny —\ Q n U | is the query evaluated on the view V. 

• Let riB = Q{D); that is no =| Q | is the query evaluated on the entire domain (we explain below how to 
do this efficiently) 

• Define: 

EST(Q,V) = ^^ 

Theorem 4.2. Let r be a constant and a, (3 be such that a > | and (3 < Then for any e > 0, denoting 

p = 2y / 3r/n(|) we have: 

Pr 2 [ \Q(I) - EST(Q, V)\> pyfa] < e 
It follows that the algorithm is useful for a > i and [3 < | — . 

Proof: Let /iy be the expected number of tuples in V which satisfy the query Q. Then, 

liy = aQ(I) + (3{n D - Q{I)) 

which reduces to 

a — p 

Instead of using py, we use ny and obtain EST(Q, V). Thus, the absolute error T, = \Q{I) — EST(Q, V)\ is 

\py - ny\ 



£ = 



a- f3 



Using Chernoff bound we can say that with an probability 1 — 2e " V 3 , the fractional error ^^J*^ < S. Thus 
with probability > 1 — 2e~ ±i ^ — , we know that £ < . Hence, for <5 = \J-^-, with probability greater than 



1 — e the following holds: 



e \ a a 2 



As a = I and /3 < | ^l, we can guarantee that the error £ would be less than 2y3r/n(|)n with probability 
greater than 1 — e. Since e was arbitrary, the algorithm is useful. This completes the proof. 
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Figure 1: View Size Ratio vs 7 
Assuming that d < Z , for satisfying the trade off between privacy and utility we can choose a — \ and a 



suitable /3, if 



d,l -7, r n 

-(1 j)« < t(— ) 

7 1 — a 4 m 



This is true when - < ~ — . Thus for d = fc— , the algorithm satisfies (d, 7)-privacy and is also useful, as can be 



seen by setting p — 4y^3^Zn(|) for e > 0. 



4.3 Discussion 

Choice of parameters Suppose we want to ensure (k—, 7)-privacy, where k is a constant, i.e. d = fc— is the 
bound on the adversary's prior probability, and 7 is the bound on the posteriori probability that is acceptable for 
us. We also want to ensure utility. Choosing a = h satisfies both Theorems 14.11 (assuming d/j < 1/2) and[ 
Next, we choose (3 
whenever - 



7 m 



This satisfies Theorem 14.11 (since 1 > j^ia), and also satisfies Theorem 14.2 



< j. In other words, with these choices of parameters we have (d, 7) -privacy, and a utility that is 

captured by p = 2^3Wn(|), where r = 4^ (smaller r's are better for utility). The privacy/utility tradeoff is like 
this. We want to protect against a powerful adversary, i.e. large fc: we want to protect well, i.e. small 7; and 
this limits the utility expressed by r. 

The View Size A concern about this algorithm is the potential size increase of the view: since most of the 
privacy comes from the newly inserted tuples (i.e. (3), clearly V will be larger than /. The expected size of V 
is simply not + (m — n)(3. For the chosen values of a = | and = the formula reduces to + -) which 
means a similar tradeoff exists for the view size. Fig []] shows the tradeoff by plotting the relationship between 
size ratio of the view and 7 for a fixed k = 5. The size ratio of a view V is simply 

The Domain If / is a table with a attributes, then we chose as domain D the cross product of all active 
domains of all attributes of /, i.e. D = D\ x . . . x D a , where Di = TTi(I), i = 1, . . . , a. This has two implications. 
First, the algorithm computes each Di, and publishes Di separately; each m% = \Di\ is relatively small, while 
the size of the entire domain m = m\m2 ■ ■ ■ m a is huge. The large size of the domain means that the estimator 
needs to compute Q(D) — \Q PI D\ efficiently. This is easy, since Q is usually a conjunction of predicates over 
several attributes, hence can be computed on each Di independently. Secondly, in the insertion phase of the 
algorithm tuples from the domain are added by first computing the number of tuples to be added using a binomial 
distribution with parameters m — n and (3. Then a sample of that size is drawn by randomly selecting tuples 
from the domain without replacement. 

A Case Study To understand the tradeoff between privacy and utility, we experimented with US census 
data in the Adults database which has also been used for experiments in ([8],[l]). The database / has 9 attributes 
and n — 30162 tuples. The product of the attributes active domains has m — 648023040 tuples, thus n/m ~ 
4.65 x 10~ 3 . We chose the parameters a and j3 to ensure (lOn/m, 0.2)-privacy: i.e. the adversary's prior is 
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Figure 2: EST(Q,V) vs Q(I) 

bounded by d = 0.465% and the posterior by 7 = 20%. The upper bound on the posteriori, 7, as a theoretical 
upper bound, is for an arbitrarily biased adversary: in practice, the posteriori is much smaller. We chose the 
algorithm's parameters as a = 0.5 and f3 = 9.5 x 10 -4 . 

Next we tested the utility of this data, by running all possible selection queries with up to three attributes. 
Figure shows the result. Each dot represents one query Q, where x — Q{I) and y — EST(V, Q): thus, a perfect 
estimator would correspond to the y = x line. This experiments illustrates the desired behavior of our algorithm. 
For small values of Q(I) the estimated values are far off (even negative): this is necessary to preserve privacy. 
But for large values of Q(I) the estimations are much better and meaningful. As expected most of the values lie 
in a band of width 1000 around the line y = x, thus showing that the error was indeed additive in nature. 

5 Tuple correlations 

So far our analysis was restricted to tuple-independent adversaries. This restriction strengthens the impossibility 
result, and weakens that of the algorithm. Here we discuss several extensions to tuple correlations. First, we 
show that if the adversary knows arbitrary correlations, then no algorithm can achieve both privacy and utility. 
Then we examine a restricted class of correlations, which is sufficiently general to model join/link attacks: we 
show that our algorithm still guarantees privacy against positive leakages (utility, of course, is unchanged) , but 
it cannot protect against negative leakage. The policy of protecting against positive leakage while permitting 
negative leakage is sometimes acceptable in practice, and it raises the question whether our algorithm could be 
strengthened in this case: after all the result in Corollary 13. 71 relies on both negative and positive leakages. We 
answer this negatively, by giving a variant of the impossibility result for positive leakages only. 

5.1 Arbitrary Correlations 

Suppose we publish a view V of a database of diseases with schema (age, zip, disease) using our algorithm. An 
attacker knows the age and zipcode for both Joe and Jim. Now Joe and Jim are brothers: if one has diabetes, 
then the other is quite likely to have diabetes as well (at least that's what the attacker believes). Suppose the 
attacker finds two tuples in V matching both Joe and Jim having diabetes. The probability that none was in 
I and were inserted by the algorithm is very small: /3 2 . In contrast, because of their strong correlation, the 
probability that they were in the instance I is now much larger: Prjt n t'} ^S> Pri[t]Pri[t']. Thus, upon seeing 



12 



both tuples the attacker concludes that they are Joe and Jim's with high probability. This is the reason why 
our algorithm has difficulties hiding data when the prior has correlations. 

We show here that even for d > — no private and useful algorithm exists if the adversary is allowed to know 
arbitrary correlations. The proof appears in the appendix. 

Theorem 5.1. If A is a useful algorithm, then for any e > there exists n such that for every d > ^ there is 
a d-bounded adversary and tuple t such that Pri[t] < d but Pri2[t\V] > 1 — 2e 

5.2 Exclusions 

We consider now some restricted forms of correlations: exactly one tuple from a set of tuples occurs in the 
database. 

Definition 5.2. A d-exclusive adversary is a d-bounded adversary with the following kind of correlations among 
the tuples: There is a partition of D into a family of disjoint sets, D = |L Sj, s.t. 

• Tuples are pairwise independent if they do not belong to the same set. 

• Exactly one tuple in each set occurs in the database instance 

This type of correlations model naturally adversaries performing join/link attacks. Such attackers are able 
to determine some of the identifying attributes of a particular tuple, say the age and nationality of a person that 
they know must occur in the database. The adversary can thus identify a set of tuples Sj of the domain such 
that exactly one tuple in the set belongs to the database. 

5.2.1 Positive results for d-exclusive adversaries 

We have already seen the privacy analysis for the insert-remove algorithm against d-independent adversaries. 
In this section, we show that the algorithm provides a slightly weaker form of guarantee for d-exclusive adver- 
saries. We show that the algorithm ensures that there is no positive leakage for negatively correlated d-bounded 
adversaries: That is if Pr\[t] < d then Pri2^|V^] < 7. 

Theorem 5.3. If a = \ and (3 > — (tE?) then the algorithm ensures absence of positive leakage for all tuples 
and for all d-exclusive adversaries, i.e if Pr\[t] < d then Pr\2\t\V\ < 7 

The proof of the theorem appears in the appendix. The algorithm cannot guarantee absence of negative 
leakage for d-exclusive adversaries, as seen from the following example: 

Example 5.4 Suppose we publish a view V of a database of diseases with schema (age, zip, disease) using 
our algorithm. An attacker knows the age and zipcode of a person Joe and additionally knows that no other 
person in the database has the same combination of age and zipcode. The prior belief of the attacker is that 
Joe has exactly one disease but any one of the diseases is equally likely. Suppose the attacker finds a tuple t\ 
in V matching Joe and Diabetes. Let us consider the tuple t?, corresponding to Joe and Malaria. Theorem 15.31 
shows that there won't be a positive leakage for t\. However, for ti there is a drastic drop in the a posteriori 
probability causing a negative leakage. 

5.2.2 Impossibility results for d-exclusive adversaries 

In this section we extend the impossibility result for d-exclusive adversaries to accommodate the case when 
negative leakage is legally allowed and only positive leakage is considered as a privacy breach. The impossibility 
result shows that if an algorithm satisfies some form of weak utility then there exists a d-exclusive adversary, for 
d = f2(^=), which can infer that a certain tuple exists in the database with high probability. 

The extension works only for a restricted class of randomized algorithms. However, the class is broad enough 
to encompass many of the privacy preserving algorithms in the literature. The class of algorithms which we 
consider satisfy the following bucketization assumptions: 
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• The algorithm is such that if the prior distribution is tuple independent then the posterior distribution 
is also tuple independent. More formally, the tuples in the domain can be partitioned into buckets such 
that the distribution Pri2[ti|V] over the tuples obeys the following condition: If two tuples U and tj lie in 
different buckets then they are independent conditioned on the view. 

• Let the number of buckets be Nb- We assume that the distribution of tuples both of the database instance 
and the domain among the buckets is not too skewed. More formally , for every k > 1 there exists a 
Nk < Nb such that if we remove any Nk buckets, the remaining still contain a fraction 1/k of tuples of I 
as well as D. 

For example consider the method of full domain generalization (in [7]) as applied to ensure k- anonymity or 
the anatomy method (in as applied to ensure /-diversity. For both the methods the tuples corresponding 
to the different anonymized groups form buckets which satisfy the assumptions above. 

To show the impossibility result for d-independent adversaries, we used e-indistinguishability as the privacy 
definition and meaningfulness as the utility definition. Impossibility result for d-exclusive adversaries requires a 
slight modification to both privacy and utility definitions. 

Definition 5.5. Consider all database instances I and I' which disagree exactly over a pair of tuples (i.e. 
\I\ = \r\ and | I — I' |= 2). An algorithm satisfies e-indistinguishability over a set D' , if for all database 
instances I and I 1 which disagree exactly over a pair of tuples with both the tuples in D' and for all views V, 

e^K^M. < e< 
-Pri'[V] ~ 

Thus e-indistinguishability over a set D' is a relaxation of the original e-indistinguishability definition. Next we 
consider the notion of utility called fc-meaninglessness. The notion of utility is a generalization of meaninglessness 
and is also defined for counting queries Q s.t ^(1 — /) < ^ < i(l + /), for a constant / < 1. Intuitively, the 
definition tries to capture the fact that an algorithm will have bad utility if many queries Q have an error of 
0(n) in their estimates. 

Definition 5.6. An algorithm is called k-meaningless, if there is a set S with \S\ < n(l — 4) such that the 
distributions Pr® and Pr'® have a statistical difference smaller than 1/2 for a fraction 2/3 of the queries Q. 
Pr'^: Pr 12 [V\E Q ] where E Q is the event Q{I) = \Q n S\ and \I\ = n 
Pr'®: Pr 12 [V\E' Q ] where E' Q is the event Q(I) = \Q n S\ + § and \I\ = n 

Note that for k = 1, the definition reduces to notion of meaninglessness. fc-meaninglessness follows the same 
intuition: If SD(Pr'® , Pfg) is small then it would be impossible to distinguish whether the original answer is 
\Q(1 S\ or \Q(1 S\ + x thus resulting in an error of at least Cor. 15. 71 relates the notions of e-indistinguishability 
over the set D' and fc-meaninglessness; the proof appears in the appendix. 

Corollary 5.7. Let D' be a set of tuples such that \D'\ > tt and \D' n I\ > x> f or some constant k independent 
of m and n. Then there exists a constant c such that every algorithm which satisfies e-indistinguishability over 
the set D' with e e < is k-meaningless. 

We call any algorithm which is not fc-meaningless as fc-meaningful. As any algorithm which is meaningless is 
also fc-meaningless, it implies that every fc-meaningful algorithm is meaningful. 

Theorem 5.8. Let A be k-meaningful algorithm which satisfies the bucketization assumptions with Nk > 
Then there exists a constant c independent of n and m for which there is a d- exclusive adversaries with d = 
max(-^,c-^=) having positive leakage on some tuple. 

The proof of Theorem l5.8l appears in the appendix. We give a brief overview outlining the intuition: We argue 
that for a ^-meaningful algorithm which satisfies the bucketization assumptions, there exists a <i'-independcnt 
adversary for which either there is a positive leakage on one tuple or there is negative leakage on lot of tuples. The 
presence of multiple negative leakages is shown by using the fact that, for a fc-meaningful algorithm, any large set 
containing a significant fraction of tuples of the database will have leakage on some tuple. Using the bucketization 
assumption we prove that if there is no positive leakage, then many buckets will contain tuples having negative 
leakage. In case of positive leakage we are done as the d'-independent adversary serves as d-exclusive adversaries. 
In the case of multiple negative leakages, we explicitly construct a d-exclusive adversary using the d'-independent 
adversary such that there is positive leakage for at least one tuple. Here, we use d' = d/3. 
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6 Conclusions 



We have described a formal framework for studying both the privacy and the utility of an anonymization 
algorithm. We proved a tight bound between privacy and utility, based on the attacker's power. For the case 
where privacy/utility can be guaranteed, we have described a new, quite simple anonymization algorithm, based 
on random insertions and deletions of tuples in/from the database. We have done a limited empirical study, and 
saw a good privacy/utility tradeoff. Our algorithm increases the size of the data, but by tolerable amounts (a 
factor of 10, in our empirical study). It will be interesting to study in future work ways to reduce the size of the 
published view. 
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A Relationships among privacy definitions 

A.l (d, 7)-privacy and (d, <5)-relative-privacy 

Proposition (|2.5p . Every (d, 5) -relative-private algorithm is (d, ^-private for 7 = de s . Conversely, every (d, 7)- 



private algorithm is (d, S) -relative-private for 

„i_7l-rf 
dl -7 



e 



Proof: The proof of the first part trivially follows from the definitions. For the converse let A be a (d, 7)-private 
algorithm. Thus A ensures no positive/negative leakage for all d-independent adversaries. We shall use this 
fact to show that A should satisfy (d, 5)-relative-privacy. For this we consider Pri2[ti|V] purely as a function 
of Pri[ti] while keeping the prior probability of all other tuples constant. Let us call the function as /. We 
compute the function explicitly as: 



Pr 12 [Vnti] _ Zi jeIt Pr 12 [V n Ij 



Pr 12 [V n U] + Pr 12 [V n U] ^ IjeIt Pr 12 [V fl /,•] + Z r . eIl Pr 12 [V n 7J]] 

^^Pr^VlI^Pnilj] + Ej^Pr^VlI^Pr^} 

Here, It represent the set of instances which contain ti and If represent the set which does not. For each instance 
Ij in the set It we can decompose Pri[Ij] as Pri[ti]Pri[Tj] where Tj is the event that denotes that only the 
tuples in Ij except for ti occur in the database. This decomposition is possible because of tuple independence in 
Pr\. Using this decomposition we can rewrite Pri2[fi|V] as 

Cl Pn[ti] 



ciPri[ti]+C2(l-Pri[ti]) 



where c\ and c 2 are constants as we vary Pri[tj] and keep the prior probabilities of all other tuples constant. 
Let us represent Pri2[£j|V] as the function 

fix) 



c\x + c 2 (l — x) 



We notice that the function is increasing in x and has slope ^ at the origin. Additionally, if c\ > c 2 then 
f[x) > x for all x. On the other hand if cl < c 2 then f(x) < x for all x. We are interested in the maximum and 
minimum possible values of The maximum occurs when c\ > c 2 at points near the origin and is equal to 

The fact that A has be safe from privacy leakages imposes certain restrictions on /. One condition is that 

fid) < 7 which implies that 2i < jj-^R . Another is that Vir. : > ^. It follows that such an algorithm A is 
also (d, (5)-relative-private for e s = max( ^~^] , ^) = 7^~^p ■ 

A. 2 id, 5)-relative-privacy and e-indistinguishability 

For proving Proposition 13. 5( we need to show the following lemma, 

Lemma A.l. Let A be a (d, 5) -relative-private algorithm. Then, for all tuples ti such that Pr±[ti] < d, 

^-de 5 Pr 12 [U\V] m 
l-d ~ Prx[U] ~ l-d [ 1 

Proof: Assume Pr 1 (i i ) < d. Additionally, we know that 

s< PrMV] s 
~ Pn{U] ~ e 
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It is easy to see that using the above two inequalities we can bound ^-pfjjflp as required. This completes the 
proof. 

Using sandwich theorem on ([1]), we note that 

hm d ^o{ ) = 1 

Pr\ [U\ 

Thus, for any given S, we can choose d small enough such that \ < < 2 



Proposition (|3.5p . Every (d, S) -relative-private algorithm satisfies e-indistinguishability with e = 25 + 21n(2) 
Proof: Consider a d' for which Pr[ is such that 

2 " Pr'Si] ~ K 1 

We know that such a d' exists for every 5. If d < d' then we let Pr[ = Pr x . On the other hand if d > d! , 
then we use the fact any [d, <5)-relative-private algorithm would also be (6, d') -private algorithm and hence from 
here on we consider the algorithm as (5, d')-private and assume that equation ^ holds true for Pr\. 

Consider any two database instances I and I" which differ in exactly one tuple with / containing one extra 
tuple t\. Then we can represent I as t\ fl T and I" as t\ n T. Here T represents the event that all tuples in I" 
belong to the database and all tuples in I do not belong to the database. 

Pr 2 [V\I] _ Pr 2 [V\(ti n T)] _ Pr 12 [V n h n T] Pr^] _ Pr\ 2 [h \ {V H T)] Pr*i [t{\ 



Pr 2 [V\I"} Pr 2 [V\(hnT)} Pr 12 [V n h n T] Pri 2 [fi|(ynT)]Pn[ti] 

PnaNO^VT}] Pri 2 [fi|(y n r)] Pr^ijVl grgjfrjVj 
Pri 2 [ti|V] Pr 2 [ti|V^ Pn[*i] Prx[fi] 

Consider a d-independent adversary for which Prx[T] = 1 but Pri[ti] < d. As the algorithm satisfies 
(d, <5)-relative-privacy, it should be safe against such an adversary. Thus, 



e 



'2 y - Pr 2 [V|P'] 

Similarly consider the database instances /' and I" such that /' has one extra tuple t 2 . Again, we observe that 

V " Pr 2 [V\F'} ~ y ' 
Thus, combining the two inequalities we can see that there exists an e = 2 (<5 + ln(2)) such that 



" Pr 2 [V\I] ~ 



This completes the proof. 



B The strong Impossibility result 

Lemma B.l. z/e e < then Var Q (Pr 12 [V\E Q }) < &e ' Pr ^ n2 
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Proof: Let x denote a database instance of size n and let p(x) denote Pri2[V|x]. Thus, Pria[V] is exactly 
V xP (x) = E K 7^- Additionally, Pr 12 [V\E Q ] = E xe s Q p(x) = E. e s Q f^y- Note that E Q (Pr 12 [U|P Q ]) = 

Pria[V]. We want to compute VarQ(Pri2 [VI-Eq]). Let Xq( x ) denote the indicator variable corresponding to 
the event x € Sq. 



Va.v Q {Pr 12 [V\E Q ]) = Eq( ^^ m ^ v '- )'-{ 



ri 



m-n\( n \ ( m n T ) 



1 \ - p(x)p(y) 

2-~> ( m )( m 7 n )( n ) V i J\n-i 

\ n I i=0 \x-y\=2i V n I \ i I \n-%) |_ 
1 ™ 

V n / i=o 

1 ™ 
= TmrV 

V n / t=0 

Here |x — y| is the symmetrical difference between the instances x and y. Thus \x — y\ represents the number 
of tuples in which x and y disagree. We use the notation m = J2\ x -y\=2i ^™^m-n^ s p c i = ( m7 ~i~") („" j) > 

* = ("7") j^y and k = ci - *■ 

Note that E, ^ = 0- Moreover we can rewrite b{ as the product of ( m i "J ( n ™ J and v ^, 71 !_„^ — . We can see 
that the first factor is always positive while the second factor decreases as i increases. Thus bi < ==>• < 

p(y) 



For the sequence di it is easy to see that Y) 4 - i J^ 2 - = ^ff? 2 = Pri 2 fVl 2 . Due to property of 

UJ (J 

e-indistinguishability we know that if \x — y\ — 1 then e~ e < < e e . Using this we show in Lemma TB.2I that 



Vi, e~ e < -2*- < e e 

— a;+l — 



Let i be the largest index s.t. 6 4o > 0. Then b l0+l < and thus tj%£y- < ^±yX As EjLo a ^ " 
Pri2[U] 2 , we know that in particular 

/ m—n\ { n \ 
\i + l) \n-ig-l) ^ p rrri2 
a ^0 + l T^A S P»"l2 [V \ 

fin—n\ ( n \ 

~ — o — - Pri2[v] 

( TUT — 7l\ f Th \ 

v. a «o V i +l /\»-io-l/ < Pn 2 [y] 2 



/mr—n 
\ i 



C» /2 ) 

) L-i ) ( mr -n-i )(n- i ) 



10 (Z) (*o + l) 2 

(T)(A) ^VPr 12 [V] 2 

(T) " "»T-2n 
a io c io < 2n 2 e £ Pr 12 [y] 2 



<e e Pr 12 [y] : 



hit — 2n 



If e e < mT 2r fi n then a^Ci < ai+1 2 Ci+1 for all i. Thus the entire sum j^kr^ E™=o a ^ )i can ^ e DOun ded as Y^iLo a i°i — 
2 y,'° < 4 " e - p, * 12 [^] , As m is much larger than n and r is a constant, we can assume that mr > 4n. Thus we 
get that Var Q (Pri2[U|P Q ]) < 8 " V ^ r 12[Vl2 . This completes the proof. 
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Lemma B.2. Vi, e~ e < < e £ . 

> — Oi+l — 

Proof: Let Sf be the set of instances {y | \x — y\ = 2i} 



E 

|x-j/|=2j 



p(x)p(y) 



D( m 7")(n- < ) 



E 



p(y) 



(m—n\ ( n 



For each y, consider the set of instances S y obtained by removing one of the i tuples in y by one of the i tuples 
of x. Hence there are i 2 such instances. Note that Vy' G S y , p(y') < e e p(y). On the other hand each instance 
y' G U ye s*Sy is obtained from exactly (n — i + l)(m — n — i + 1) instances from Sf. Thus 



p(ac) 



E .... 

X V Tl 



> 



E 



p(x) 



E 
E 



(n — i + l)(m — n — i + 1) p(y') 



Aim 



P(y') 



— n\ I n 
i J \n—i 



: /m—n\ / n \ 
' V i-1 / Vn-i+lJ 



ai-i 



Similarly we can show that < e e ai_i. This completes the proof 

Lemma B.3. With probability at least (1 - a), SD(Pr[z], Pr s [z}) < O (j^j 
Proof: The proof is exactly the same as in the proof of Lemma 2 shown in [4]. 



C Tuple correlations 

C.l Arbitrary correlations 

Theorem. I5.il If A is a useful algorithm, then for any e > there exists n such that for every d > ^ there is a 
d-bounded adversary and tuple t such that Pr\[t] < d but Pr\2\t\V] > 1 — 2e 

Proof: Let / be the input instance and n = \I\. Consider a d-bounded adversary with the following tuple 
correlations: The adversary knows that either all tuples of the set S — I belong to the database or none of the 
tuples belong to the database. Consider the query Q — S. As the algorithm is useful, for every e there exists a 
p such that the \EST(Q, V) — Q(I)\ < P\/n with probability greater than e. If n is large enough, then it follows 
that algorithm cannot output the same view on I and any instance /' in the set {I'\S C D — I'} with probability 
greater than 2e. Hence, from the view there will be a breach with probability greater than 1 — 2e. 

C.2 (i-exclusive adversaries 

C.2.1 Positive result for c?-exclusive adversaries 

Theorem f)5.3[) . If a = i and P > 2 -(iE3) then the insert -remove algorithm ensures absence of positive leakage 
for all tuples and for all d-exclusive adversaries, i.e if Pr\[t] < d then -Pn^ly] < 7 

Proof: The probability Pr[t;|V^] is only dependent on which tuples in S appear in V. It is clear that maximum 
privacy leakage happens when V contains U but does not contain any other tuple from S. In that case, Pr[V|£i] = 
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a(l-/3)(l s l -1 ) and Pr[V\tj] = /3(l-a)(l-/3) (|s| ~ 2) . Also, let Pr[U] = p. For a d-exclusive adversary U = Ujjutj. 
Thus we get, 



141 J a{l-0)p + 0(l-a)(l-p) 
As p < d, a = | and (3 > -(jE%), we get Pr[ii|V] < 7. This completes the proof. 



C.3 Impossibility result for rf-exclusive adversaries 

Corollary (|5.7[) . Let D' be a set of tuples such that \D'\ > ^ and |P'n/| > j, for some constant k independent 
of m and n. Then there exists a constant c such that every algorithm which satisfies e-indistinguishability over 
the set D' with e c < c-^ is k-meaningless. 



Proof: We can think of a random Q with |Q| 



_ m(l-f) 



as picking a random subset of size 



m(l- 



from D. Let X 

|g'|(i-/) 
2 



stand for the random variable denoting the number of tuples in D' which are also in Q. Then E(A) 

and Var(A) = l^l^Zl. If \ D '\ > m thcn 

using Chcrnoff inequality we can show that with probability greater 

1 — 2e~iS? the random subset will have size between m ^l~^ and 3m ^~f) m £)'. 



2k 



2k 



Let I' = D' f] I, then using Theorem 13.41 for tuples in /' we can show that with probability greater than |, 



the statistical difference between Distribution 0' and Distribution 1' is at most O 
dominated by O f^ - ) • 

Distribution 0': Pn 2 [V\E] where E is the event Q(I') = 
Distribution 1': Pr 12 [V\E'] where E' is the event Q(I') = \ f\ 

This shows the distributions Pr 



2e~ 



fc 2 which is 



and Prff have statistical difference O 



|, where 



1 

(^m - ) 3 W ^ n Probability greater than 



Pr'%: Pr 12 [V\E Q ] where E Q is the event Q{I) = \Q n J'| 
Pr^: Pri 2 [y|Py where P^ is the event Q(7) = |Q nP| + f 
If e e = 0( I %) : then v4 is fc-meaningless. 



Theorem (|5.8p . Let A be k-meaningful algorithm which satisfies the bucketization assumptions with > |. 
Then there exists a constant c independent of n and m for which there is a d- exclusive adversary with d = 
max(-^ , c-j=) having positive leakage on some tuple. 

Proof: As A is fc-meaningful, it is also meaningful. By Corollarv l3.7l we know that there exists a d'-independent 
adversary Ad\ which either has a positive leakage or negative leakage. Here we use d 1 such that d' = | . If it has 
positive leakage then that adversary serves as the required (i-exclusive adversary. On the other hand, if it has 
negative leakage on tuple t\ we choose the bucket B\ which contains that tuple. The set D' = D — B\ is thus 
the set of tuples which are conditionally independent of tuple t\. 

Let us define, I' = In D' . We know from the bucketization assumptions that \D'\ > ^ and I' > Let us 
define, e e = maxjji p^yjj)^ for all / and V which disagree over a single pair of tuples with both of them in D' . 
As A is k- meaningful, from Corollary 15. 7i we know that e £ = fl(^). By restricting over the set D' and using 
Corollary 13. 7i it follows that there exists a d! -independent adversary Ad' 2 for which there is a leakage for some 
tuple t 2 in the set D' . If there is positive leakage for t 2 then Ad' 2 serves as the required d-exclusive adversary. If 
there is negative leakage on t 2 then let the bucket which contains it be B 2 . Consider the adversary Ad 2 which 
has prior of Ad' 2 for tuples in B 2 and prior of Ad\ for tuples in B\ . As the posterior probabilities of tuples in 
B 2 are independent of that of tuples in B\ even conditioned on V, the leakages for Ad' 2 in B 2 and for Ad\ in B\ 
would be preserved for Ad 2 . Thus, Ad 2 has negative leakage for both tuples t\ and t 2 . 

We repeat this procedure § times, with buckets being B, for i e [1, §]. Define B = UP,: and D B = D - B. 
As Nk > 2 1 by the bucketization assumptions on the algorithm, it follows \Db\ > tt and \Db H I\ > r- Hence, 
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Corollary 15.71 holds for each step and we can show that there is either a d'-independent adversary for which 
there is positive leakage on some tuple or there exists a d'-independent adversary Ad for which there is negative 
leakage on at least i tuples. In the former case, we have the required d-exclusive (i-bounded adversary. For 
the latter case, let the set of tuples with negative leakage be S' . We know that each tuple in S" belongs to a 
different bucket. Thus, we can increase the prior probability of each tuple t t in S so that Pri[ti] — d! . The 
negative leakage on each t{ G S is still preserved because as shown in the proof of Proposition 12.51 the ratio 



P fy|[t^ increases by at most j^p- if we increase Pri[ti] to d' while keeping the prior probability of all other 
tuples constant. For d! < 5, the increase in the ratio is by a factor of at most 2, thus preserving the negative 
leakage. 

Additionally there exists at least one tuple for which the prior Pr\ of Ad is such that Pri2[tp|y] = cdl . This 
is because Db contains at least jr tuples and Ad can have any prior probabilities for each one of them as they 
will never effect the posterior probabilities of the leaking tuples. So, we can change the prior of any tuple in D' 
so as to get the required posterior probability for t p . This is shown in Lemma [C . 1 1 proved in the appendix. Let 
us define S as S' U t p 

Using Ad we construct a d-exclusive adversary Ad' having positive leakage on tuple t p . Ad' corresponds to 
the adversary who knows exactly one tuple in S occurs in the database instance. We construct Ad' from Ad 
by making the prior probability of each illegal instance (instances which do not exactly one tuple from S) as 0. 
We do this by distributing the probability of each such illegal instance over the entire set of legal instances such 
that the probability of every legal instance increases by an amount proportional to its original probability. Let 
L denote the set of legal instances and IL denote set of illegal instances. 

For Ad, Let us call r = J^reiL Prx d [I'\. We know that r = 1- (1 - d')^'^ < 1 - ~- In Ad ' tnis sum gets 
distributed to every legal instance and thus Pr^ d [I] = Prf d [I } ( ) for every legal instance I. We know that 

Pr^ d [U] < Pr^ d [ti] Y3^r < d'e < d. Thus Ad' is ci-bounded. Lemma [C.21 proved in the appendix, shows that 
the constructed adversary Ad' is indeed d-exclusive. We next show that there is a positive leakage on tuple t p 
for Ad' . Let us compute for any ti in S' and t p the ratio 

Prtf'MV] = E/p ti Pr2[V\I]Prt d '[I] 
Pr^ d '[t P \V] Er DU Pr2[V\r}Prf d '{r} 

J2 IDU PHV^Pr^lI}^) 
Er Dti Pr 2 [V\I'}Prt d [I']{j^) 
Ei DU Pr2[V\I]Prt d [I] = PrgMV] 

Er^Pr2[V\I'}Prf d [I'} Prf d [t p \V] ( ) 



(4) 



< 



-JUT 
cd 



In 

< 77^ ( 6 ) 



Note that (j4|) can be derived from ([3]) as both the summation are being done over a subset of legal instances. 
Also dU is true as there is negative leakage for U and Pr^f [tp\V] = cd. Note that ([6|) holds for all ti in S' . 
Moreover, we know that 

Y,Prg'[ti\V]+Prtf[tp\V] = l 

ies' 

Thus Pr 12 [t p \V] is at least r — — • Hence there exists a d-exclusive adversary for which there is positive 
leakage for at least one tuple. This completes the proof. 

Lemma C.l. Let A be a k-meaningful algorithm. Let D' C D be a set of tuples such that \D'\ > ? and 
\D' PI J| > j:- Then there exists a tuple t in D' such that Pri2[t\V] > d — for the uniform tuple independent 
distribution Pr" and some constant d independent of n and m. 

Proof: We know that for any query Q, 

EST(Q,V) D = ]T Pr u 12 [U\V] 

ueD 
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If we restrict the EST(Q,V) over tuples in D' , then 

EST(Q,V) D , = p rUti\V] 
neD' 

Let us define I' = D' n /. For the algorithm to have some utility over domain D' , there exists a constant I 
with I > k for which some query Q has the following properties: \Q fl I'\ > r and EST(Q, V)d' > f- If n °t 
then it would be impossible to distinguish whether Q(I) > r or Q(I) < j from the view for every query Q. 
As EST{Q, V)d> = Eti££>' Pr i2N^] > f • Thi s means, there exists a tuple t in £>' such that Prf 2 [t\V] > ^. 
Hence proved. 

Note that P p i ?ij[|^ = y . As shown in the proof of Proposition 12.51 increasing the prior probability to d will 

change the ratio of posterior to prior probability to < ^. Thus, for the tuple t there exist Pr\ such that 
Pri 2 [t\V]>cd. 



Lemma C.2. The constructed adversary Ad' is d- exclusive 

Proof: We show that Ad' is d-exclusive adversary by first showing that all tuples t\ and t 2 which are not in S 
are still independent. Let us define E as the event that exactly one tuple in S occurs in the database. We know 
that 

p Adu n , n m Prt d [hnE]Prt d [t 2 nE] (EieLi^u g^WKEze^g ^[1]) 



= (1-T 



Pr^ d ' [E] 

Additionally, we can also compute Prf d \t\ nt 2 C\E] as 



Prf^hnhnE] = Yl = (1-r) J2 Prf d '[I] 

ieLi\iDt u t 2 ieLi\iDt u t 2 

= (i-T)Prf d '[t 1 nt 2 nE] 



As Prf d ' [E] = 1, we get Pr^ d ' [t 1 n t 2 ] = P rf d ' [ti]Prf d ' [t 2 ]. Also, the prior of Ad' is such that J^tes Pr i W = L 
Thus the adversary is indeed d-exclusive. 
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